The following tutorial will let you reproduce the plots that we created at the lecture using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section (see a simple example below this paragraph). You are more than welcome to try it yourself before checking the code, but there’s also the option to show or hide all code blocks at the very top right of this document.
Good luck!
## [1] "Hello World!"
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and managing your workspace. It requires R to be installed prior to be able to send commands to the interpreter.
If we want to keep things simple (for this course) or we would like to use R on shared computers, where we can’t install software, we can run R and Rstudio through a web client that is hosted on a remote server.
We will use the Binder service, which is free, easy to use and can be launched from a single GitHub repository (more about this in the workshop).
See the appendix for more details how you can run R and RStudio on EcoCloud, another cloud-based service free for Griffith students.
Using Binder is as simple as clicking on the Binder badge - .
Alternatively, you can navigating to the GitHub repository of this tutorial (or any other Binder-ready projects) and click on the Binder badge.
You should see now an RStudio interface in your web browser and are now ready to start working in R in “The Cloud”!
Both R and RStudio can be installed locally on any operating system (, , or , see a detailed tutorial), which provides complete control over the installation, added packages and can be used anywhere without requiring internet connection.
Regardless whether we installed R and RStudio locally or we use the Binder service, we interact with R through the RStudio integrated development environment (IDE), which let’s us easily write our code, test it, see our files, objects in memory and plots that we produce. If we run the analysis locally, it is highly recommended to use RStudio’s built-in Projects to contain our analysis in its own folder with all the files required. That will also help in reading data files and writing results and figures back to the hard drive.
- Start RStudio by clicking on its icon.
- Start a new project by selecting “File –> New Project” or clicking on the “New Project” icon (under “Edit” in the taskbar).
- Select “New Directory –> New Project” and then enter “Workshop1” in the Directory name text box and browse to the “wrokspace” folder to create the project folder in (see screenshots A-D in Figure 3 below)
Figure 3: Create a new project in RStudio screenshots.
- Create a new R script file by selecting “File –> New File –> R Script” or clicking on the “New File” icon (under the “File” in the taskbar)
- Save the script file by select “File –> Save” or pressing Ctrl+s or clicking on the floppy disk icon on the top bar
R can be extended with additional functionality by installing external packages (usually hosted at the Comprehensive R Archive Network repository – CRAN). To find which packages can be useful for your type analysis, use search engine (Google is your friend) and the available Task Views on CRAN, which provide some guidance which packages on CRAN are relevant for tasks related to a certain topic.
For our current analysis we will use some packages from the tidyverse – a suite of packages designed to assist in data analysis, from reading data from multiple source (readr, readxl packages), through data wrangling and cleanup (such as dplyr, tidyr) and finally visualisation (ggplot2), as can be seen in Figure 4.
if tidyverse is failing to install then try bplyr
Figure 4: An example of a data analysis workflow using packages from the Tidyverse (credit to The Centre for Statistics in Ecology, the Environment and Conservation, University of Cape Town).
These packages are already pre-installed in Binder, but they will need to be installed if you chose to run the analysis locally.
To install these packages, we use the install.packages('package') command, please note that the package name need to be quoted and that we only need to be perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(package) command. Note that in this case we use the package name without quotes!.
# install required packages - needed only once! (comment with a # after first use)
install.packages("tidyverse")
install.packages("here")
# It seems like there is a probelm with some versions of the `tibble` package, which we can overcome by installing the most recent development version
install.packages("remotes")
remotes::install_github("tidyverse/tibble")Now that we’ve got RStudio up and running and our packages installed, we can load them and read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).
We need to provide a variable name that will store the data and remember that R holds its variables in the computers RAM (which will be the limiting factor in terms of size of data that can be handled).
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file hosted on the web into a variable of type data frame (table).
# load required packages
library(tidyverse)
# Read data straight from the web
gapminder <- read_csv("https://tinyurl.com/gapdata")You can also read the data from a local folder (in this case the file gapminder.csv in the data folder)
we can explore the data by clicking on the table icon next to the variable name in RStudio “Environment” tab (top right pane), but that’s not a good practice because it won’t work with large data sets. We can use built-in functions for a brief exploration (such as head() to show the first 10 rows of the data and str() for the type of data in each column):
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
We will use the ggplot2 package, which stands for “Grammar of Graphics” and breaks up graphs into semantic components such as scales and layers that can be added to the plotting area, as can be seen in Figure 11.
There are lots of online tutorials on the use of ggplot2, one of my favourites is the Beautiful plotting in R: A ggplot2 cheatsheet.
Figure 11: A visualisation of the layer concept in ‘ggplot2’ package (starting from bottom up, credit to Coding Club).
In this exercise we will plot the life expectancy throughout the years for the first 12 Asian countries (alphabetically) as a line graph. We use the %>% notation (can be easily entered with Ctrl+Shift+m keyboard shortcut) to “pipe” our data from one processing step to another without having to save it as intermediate variables. In this example we take gapminder data frame, filter it by coontinent=="Asia" (notice the double == sign and that the match is case-sensitive!), then select the country column, identify the unique country names with distinct(), select the first 12 countries with the slice(start:end) indexing notation and finally extract the country column as a vector. We then create a subset of the data using these countries as our reference to match against when we filter by country name (with the %in% notation).
Once we have a subset of the data we can use it as input for plotting with ggplot() function. We need to specify to the function which data to operate on and how to map plotting features (such as X and Y axes). We can (and will) map other plotting features to variables (columns) in our data later on.
Let’s see what we’re getting when running the following code:
# create a vector of the first 12 Asian countries
first12_Asian_countries <- gapminder %>% filter(continent=="Asia") %>%
select(country) %>% distinct() %>%
slice(1:12) %>% .$country
# use the vector to subset the data to include only these countries
gap_asia <- gapminder %>% filter(continent=="Asia", country %in% first12_Asian_countries)
# create the plot
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp)) Figure 5: Life expectancy by years in Asian countries (first try).
We got an empty canvas in Figure 5, but it had been sized to fit our range of data on the X and Y axes, that will be our first layer of the plot (like the bottom one in Figure 11).
Now, let’s use the + sign to add additional layers to the plotting canvas, starting with the type of graph we want to plot (in this case a line graph), which can be achieved by adding a geom_line() function, which stands for “line geometry” (in a similar way we can add other plotting geometries, such as geom_point(), geom_bar(), etc.).
Figure 6: Life expectancy by years in Asian countries (added line graph).
What just happened (Figure 6)? Can you guess?
ggplot2 is just doing what we asked it to do, it plots all the year-lifeExp combinations and connects them with a single line, regardless of which country the data came from. What we actually want is to connect the dots of each country separately for the graph to make sense!
Let’s assign specific colour (of lines/bar/markers) to each country in our data, this is done through the aes() function, which can go either in the initial aes() function within the ggplot() function, or added later inside the particular geometry we’re adding (inside geom_line() in this case). ggplot2 is smart enough to know that if we map a colour to each country, then the data from each country should be grouped and plotted as a separate line.
# create the plot
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp, color=country)) +
geom_line(size=1) Figure 7: Life expectancy by years in Asian countries (added line graph coloured by country).
Hooray! this looks much better (Figure 7).
We just need a few more final touches to make it “publication-ready” (in the same order their layers are added below):
RColorBrewer package which is an integral part of the tidverse)color aesthetic for that).The final result can be seen in Figure 8.
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp, color=country)) +
geom_line(size=1) + theme_bw(14) + # geom_smooth(method="lm", size=1.5) +
scale_color_brewer(palette = "Paired") +
scale_x_continuous(breaks = seq(min(gap_asia$year), max(gap_asia$year), by = 10)) +
labs(title = "Life expectancy by years in Asian countries", x = "Year", y = "Life Expectancy (years)", color="Country")Figure 8: Life expectancy by years in Asian countries (beautify the plot with themes, colour palettes and labels).
We would like to save it to a sub-folder following our “best practice” rule of separating raw data from output files from analysis scripts. We can create a sub-folder named output using the file explorer in Windows or the one built-in in RStudio (first tab in the bottom right pane). We can of course do it with an R command dir.create() as demonstrated below.
Once the output folder is created, we use the ggsave() function to save our current plot.
This time we’ll look at the relationship between GDP per capita and life expectancy using geom_point() for X-Y scatter graph and we’ll colour the points by continent. We’ll already apply the theme, custom colour palette and labels.
# create plot
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent)) +
geom_point(size=3.5) + theme_bw(14) + # geom_smooth(method="lm", size=1.5) +
scale_color_brewer(palette = "Set1") +
labs(title = "GDP per capita by life expectancy", x = "GDP per capita", y = "life expectancy", color="Continent")Figure 9: Relationship between GDP per capita and life expectancy by continent
As we can see in Figure 9, the markers overlay each other and it’s hard understanding where there’s a high density of data points and if they cover others behind them. To fix it, we’ll make the points semi-transparent using alpha=0.5 argument in geom_point().
Another issue is that the graph seems exponential, which makes it hard to see a trend and interpret the results. We can log-transform the data (which variable, gdpPercap or year?) to see if it will linearise, or we can use a visualisation “trick” and log-transform just the axis.
Check out Figure 10, the plot looks much better! We’ll save this plot to the output folder as well.
# create plot
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent)) +
geom_point(alpha=0.5, size=3.5) + theme_bw(18) + # geom_smooth(method="lm", size=1.5) +
scale_x_log10() + scale_color_brewer(palette = "Set1") +
labs(title = "GDP per capita by life expectancy by continent", x = "GDP per capita", y = "life expectancy", color="Continent")Figure 10: Relationship between GDP per capita and life expectancy by continent (log-scaled X-axis
# save plot to file
ggsave(here::here("output/gdpPercap_vs_lifeExp_by_continent.pdf"), width=10, height=8)Now for the hard part, what can we learn from the data? What other plots can we generate to help us understand trends from the data and gaps between countries?
We can discuss these in details in class…
Please contact me at i.bar@griffith.edu.au for any questions or comments.
ggplot2 plots linkFigure 12: An example of a ‘ggplot2’ theme inspired by Game of Thrones (tvthemes package)
EcoCloud platform that provides this service for all students and staff of participating Australian and NZ universities and government agencies (including Griffith University). A guide on using RStudio within EcoCloud is available here.
- Please navigate to EcoCloud and follow the prompts to login using your Griffith credentials (AAF login)
- In your dashboard, click on the orange “Launch notebook server” button in the middle of the screen and select “Rstudio notebook” in the popup window and click on the green “Launch” button (see screenshot in Figure 1 below)
Figure 1: EcoCloud dashboard screenshot.
- Once the server is running, click on the green “Open” button, a new browser tab will open with the JupyterLab dashboard
3b. optional Whenever a new server is started on EcoCloud, all previous RStudio settings and installed packages get reset back to defaults. To overcome this behaviour and make it more user-friendly for long-term and recurring uses, start a new terminal from the JupyterLab dashboard (see bottom of Figure 2) and type the following Bash (Linux) commands:
mkdir -p ~/workspace/.rstudio/library
echo ".libPaths('~/workspace/.rstudio/library')" >> .Rprofile
ln -s ~/workspace/.rstudio ~/The last 2 commands need to be run every time the server restarts
- Click on the RStudio logo in the JupyterLab dashboard (see screenshot in Figure 2 below)
Figure 2: Jupyter dashboard screenshot.
- You are now ready to start working in R in the cloud!!